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ABSTRACT 

Along with the advent of MOOCs and other online learning 
platforms such as Khan Academy, the role of online educa- 
tion has continued to grow in relation to that of traditional 
on-campus instruction. Rather than tackle the problem 
of evaluating large educational units such as entire online 
courses, this paper approaches a smaller problem: exploring 
a framework for evaluating more granular educational units, 
in this case, short educational videos. We have chosen to 
leverage an adaptation of traditional Bayesian Knowledge 
Tracing (BKT), intended to incorporate the usage of video 
content in addition to assessment activity. By exploring 
the change in predictive error when alternately including or 
omitting video activity, we suggest a metric for determin- 
ing the relevance of videos to associated assessments. To 
validate our hypothesis and demonstrate the application of 
our proposed methods we use data obtained from both the 
popular Khan Academy website and two MOOCs offered by 
Stanford University in the summer of 2014. 
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1. INTRODUCTION 

As the relative importance of MOOCs and other online learn- 
ing platforms such as Khan Academy has increased, so has 
the importance of verifiably sound online pedagogy increased 
apace. While many of the lessons learned through a long his- 
tory of research on the traditional classroom are applicable 
to the online environment, many indicators available during 
traditional instruction are not present for a designer of online 
material. In order to address the need for scalable and re- 
produceable evaluation, we hypothesize that by relating the 
use of materials and performance on subsequent assessment 
items, we can construct a metric to evaluate the relevance 
of those videos, without needing to resort to comparative 
studies. 

To model student interactions with educational material and 
improvement over time, we have chosen to use an adapta- 
tion of Bayesian Knowledge Tracing (BKT), a technique de- 
veloped and used with Intelligent Tutoring Systems (ITS) 
but which has been applied outside of that domain as well. 
We seek to incorporate behavior, such as video observation, 
which falls beyond the purview of attempting assessment 
items. We contrast this extended model with a simpler one 
excluding resource usage in order to discover whether videos 
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contribute to model accuracy, and if some models benefit 
more than others. 

Our ultimate goal is not to produce high predictive accuracy 
for the purposes of predicting students’ latent knowledge, 
but rather to provide a quantitative framework for evaluat- 
ing video resources. We set out first to prove that there is 
a reduction of predictive error when incorporating video re- 
sources into BKT analysis, in order to validate the inclusion 
of such observations. Second, we propose a metric based on 
a combination of both the delta in error between models us- 
ing and eschewing video data and the learn rate associated 
with a particular video, in order to foreground both those 
which appear most relevant, as well as those which may need 
attention. 

2. RELATED WORK 

2.1 Bayesian Knowledge Tracing 

Bayesian Knowledge Tracing [I] is used extensively in computer- 
assisted instruction environments, intended to approximate 
mastery learning. The model in its most basic form is de- 
fined by four parameters: P(Lo), the prior probability that 
a student has mastered a particular KC, or knowledge com- 
ponent; P(aS), the probability a student who knows a con- 
cept will get an associated question wrong, or ’slip’; P(G), 
the probability that a student who does not know a con- 
cept will correctly ’guess’ the correct answer; and P{T) the 
probability that a student who does not know a particular 
KC will learn it after a given observation. Through a pro- 
cess of Bayesian inference, an observed correct or incorrect 
response to an assessment item can be used to calculate a 
posterior probability that a student has mastered the KC. 
Using this posterior and P(T) as described above, a new 
prior is calculated, accounting for the probability that the 
KC was learned between observations. This process is then 
repeated, using the updated estimate, for each subsequent 
observation. 

We chose to use BKT as a modeling framework as it is 
well-studied and possesses relatively well understood prop- 
erties, with parameters which are intuitively interpretable 
and therefore potentially actionable. Additional work has 
been done to extend this basic model of BKT to incorpo- 
rate individualized parameters, based on factors depending 
both upon individual student properties (see e.g. [7], [2]), 
as well as properties of particular assessment items within a 
knowledge component [8]. 
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Source 

Total Events 

Distinct KCs 

Khan 

353,202 

176 

Economics 

689,709 

94 

Statistics 

337,428 

70 


Table 1: Properties of the three sources 


2.2 Online Course Resources 

There has been a fair amount of research devoted to studying 
the efficacy of videos, forums, and other study aids offered in 
online educational contexts. Past work has typically focused 
on issues such as student attrition, student interaction, and 
building student-facing recommender systems. For example, 
Yang et al. described a framework for helping students sift 
through the the large volume of forum discussion posts in 
order to find content relevant to them [10]. Similar efforts 
have been made to provide recommendations for more gen- 
eral content, using methods such as social media analysis 
and reinforcement learning [5] [9]. 

Relative to the research on student perception and experi- 
ence in the MOOC context, little attention has been paid to 
that of the instructor. That is not to say that such work has 
been absent. Guo et al. [3] and Kim et. al [4] offer guidance 
for the construction of videos used in MOOCs. Explorations 
of the application of Item Response theory in a MOOC envi- 
ronment [6] similarly offer instructors guidance in evaluating 
the efficacy of their assessments using traditional methods. 
Yousef et al. constructs an inventory of features, pedagogi- 
cal and technological, which contribute to a sense of course 
quality. [11]. Yet there remains a relative paucity of re- 
search on the quantitative assessment of content outside of 
the scope of assessment items. 

3. DATA 

In order to demonstrate the generalizability of our results, 
we leveraged three sources of event log data. Two of our 
datasets were taken from Stanford Online courses run using 
the edX platform: ’Statistics and Medicine’ and ’Principles 
of Economics.’ The third was taken from the popular Khan 
Academy Website. See table 1 for details. 

The data we obtained from Khan Academy contains obser- 
vation events collected over about two years, from June 2012 
to Eebruary 2014, while both edX courses were offered from 
June to September of 2014. Assessment items in Khan are 
categorized hierarchically as part of a larger ’exercise’ rep- 
resenting a particular skill, and further as a member of a 
’problem type,’ describing the template used to generate a 
specific problem, while exercises from edX are categorized 
as individual problems. Eor the sake of simplicity we have 
chosen to consider each exercise as a separate knowledge 
component (KC) for the purposes of training BKT models. 

Eor both the Khan and edX data, there was not an im- 
mediately available canonical mapping between videos and 
associated problems. By scanning the logs of learner activ- 
ity and using a metric combining chronological proximity 
of use as well as frequency of associated observation, we 
produced a mapping between videos and their related KCs. 
Because our goal was not to produce a generative procedure 
for semantically associating log events, we chose our method 



Figure 1: The Template- Videos Model 


to be sufficiently successful without introducing unnecessary 
complexity. However, this does introduce possible sources of 
error in terms of both overlooked and spuriously constructed 
mappings. 

4. METHODS 

Though the previous section describes the fundamentals of 
Bayesian Knowledge Tracing, we employ several extensions 
to the model. Eirst, and for all models used in evaluation, 
we condition P(G) and P(S) for each observation on which 
specific problem template is observed, to model varying tem- 
plate difficulty. We will refer to this model as ’Standard 
BKT’. 

Second, we similarly condition the transition probability 
P{T) on the observed problem template, generating a second 
distinct but still video-free ’Template’ model. We include 
this model for the Khan data for the sake of completeness, 
but note that there is only a single template for each edX 
problem in the data and thus the results of this extension 
are omitted for both the ’Statistics and Medicine’ and ’Prin- 
ciples of Economics’ cases 

Third, we extend our model to incorporate video observa- 
tions, conditioning P{T) either on the specific template ob- 
served or the specific video, generating the ’Template Videos’ 
model. The presence of a video observation functions simi- 
larly to that of a problem attempt, save that as there is no 
associated student response to be considered, a video is asso- 
ciated only with a unique P{T). We simplify the ’Template 
Videos’ into a fourth ’Template 1 Video’ model, conditioning 
P(T) only on the presence of either a video or a question, 
but not the specific identity of the resource observed. 

All models were trained and evaluated using 5-fold cross val- 
idation. Eor each model above, one BKT model was trained 
for each of the knowledge components. Eor each model, 
for each fold, each of the KC models was randomly initial- 
ized and trained using Expectation Maximization (EM) al- 
gorithm to minimize the log likelihood of the observed events 
25 times, with the maximally likely resulting model chosen 
for that model- fold-model tuple. The metric used to com- 
pare the four models is the root mean squared error (RMSE) 
taken across all five folds. 

5. RESULTS AND DISCUSSION 

Tables 2, 3, and 4 describe the results of running the data 
through the three analytical models. In each case, the ’Tem- 
plate Videos’ and ’Template 1 Video’ models tended to per- 
form best, while the ’Template’ model, using the Khan Academy 
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data, showed no significant difference from the baseline dis- 
tribution. The significance test is performed across the dis- 
tribution of RMSE across each of the KC models in each 
data-set. 


Model 

Mean RMSE 

Significance 

Pet. Correct 

.4930 

.0000* 

Standard BKT 

.3824 

— - 

Template 

.3824 

.9448 

Template Videos 

.3810 

.0253* 

Template 1 Video 

.3811 

.0061* 


Table 2: Khan Academy 


Model 

Mean RMSE 

Significance 

Pet. Correct 

.6243 

.0000* 

Standard BKT 

.3824 

— 

Template Videos 

.3715 

.0000* 

Template 1 Video 

.3716 

.0000* 

Table 3: 

Principles of Economics 

Model 

Mean RMSE 

Significance 

Pet. Correct 

.5551 

.0000* 

Standard BKT 

.3711 

— - 

Template Videos 

.3638 

.0000* 

Template 1 Video 

.3642 

.0000* 


Table 4: Statistics and Medicine 


Though the tables reflect changes in RMSE aggregated over 
all KC models, not all models benefited evenly from the in- 
clusion of video resources. Among the Khan data 77 of 193 
KCs saw more then a trivial amount of reduction in error, 
while in Statistics and Medicine and Economics, the bulk of 
the improvement could be seen in 57 of the 94 and 44 out of 
70 models, respectively. This asymmetry of improvement is 
an expected behavior of the system. Intuitively, in the case 
that a particular video resource is either not helpful or ac- 
tively harmful to a student in solving a particular problem or 
set of problems, this would be reflected in the trained model 
as additional noise, leaving the overall RMSE unaffected at 
best. 

Rather, the presence of a statistically significant, though 
perhaps small, decrease in predictive error in some models 
is indicative of the soundness of the hypothesis that consid- 
ering video usage can offer useful information. 

5.1 Highest and Lowest Performing Models 

fn order to gain an intuition for why some models were bet- 
ter described by the inclusion of resources, we chose to con- 
sider a selection of the best and worst performers from each 
data set under the ’Template- Videos’ condition. By examin- 
ing what properties might explain the performance of each 
model, we seek insight into what sort of videos appear to 
offer the greatest benefits to student performance. 

Eor the highest performing models in the Khan data, the 
videos appeared highly relevant to their associated exercises, 
often demonstrating solutions in the Khan interface. Eor ex- 
ample, ’The Eundamental Theorem of Arithmetic,’ explains 


the manipulation of a bespoke tool created for that partic- 
ular exercise, showing the completion of a practice problem 
using that tool. 

Eor the low performing Khan models the possible sources of 
error mirror the effects seen in the high performing cases. 
’Scalar Matrix Multiplication’ and ’Linear fnequalities’, for 
example, present video explanation very differently than 
their related videos and involve customized input fields, which 
may have been a source of trouble. 

Though the Principles of Economics and Statistics in Medicine 
edX courses are formatted very differently than the lessons of 
Khan academy, the distinctions between the best and worst 
models are similar, fn both cases, the best videos in the 
data-set are, while less compellingly visually similar than 
the Khan examples, pointedly related to the subsequent as- 
sessments. Additionally, most of the associated assessments 
allowed students only one attempt, explaining the particu- 
larly strong reduction in error when including video infor- 
mation. 

Perhaps most interesting is that one of the best predicted 
models is the ninth question on the final exam of the ’Statis- 
tics and Medicine’ course. The content of this question is 
nearly identical to content of the video from a couple of 
weeks previous, ’Practice fnterpreting Linear Regression Re- 
sults.’ ft is therefore unsurprising to find that the video, 
while not explicitly grouped with the exam, is associated 
with a very strong learn parameter; students who sought 
out the video succeed significantly more often on the assess- 
ment. 

Two of the videos related to the worst models in the Eco- 
nomics set, ’The Spending Allocation Model’, and ’The Eed 
and the Money Supply’ are both relatively long, each over 
fifteen minutes. Despite their length, each video dwells only 
briefly on the subject concerned in the assessment, spending 
most of their running time on other topics, with the perti- 
nent sections easy to skip or miss. Another worst performer 
is one of the first videos in the course, associated with a quiz 
with nearly a 90% correctness rate. 

intuitively, an unhelpful video does not contribute to a pre- 
dictive model, simply adding additional complexity and noise. 
By measuring which videos do and do not contribute con- 
structively to predictive accuracy, it may be possible to de- 
tect which videos might be most appropriately suggested as 
helpful for a learner, and which need revision, fn particular, 
such results could be useful to an instructor or course man- 
ager in navigating what to improve and what to keep when 
iterating on a course between offerings. 

6. CONCLUSIONS AND FUTURE WORK 

fn this paper, we have demonstrated that the inclusion of 
video observations in a KT model can offer information rel- 
evant to predicting student behavior, not only in one data- 
set, but generalizably across multiple domains. Though the 
effect size is small, the statistically significant decrease in 
error under the ’Template 1 Video’ and ’Template Videos’ 
conditions across the three data-sets considered is an en- 
couraging sign, ft is indicative that there is information to 
be gleaned from a learner’s use of video resources. Eurther, 
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Figure 2: Videos from Khan Academy contributing 
maximally to model accuracy tended to closely mir- 
ror subsequent assessments 


as suggested by our investigation of some of the superlative 
models, it is possible that the delta in error generated by a 
given model, coupled with the associated P(T) for a video 
within that model, could be a useful metric for evaluating 
video relevance. 

One piece missing from this analysis is a canonical associa- 
tion of videos to exercises. Though we generated and used 
a set of associations, we may have lost information in the 
process. Another avenue worth pursuing is the possibility 
that some users would benefit strongly from video resources 
while others may not. To that end, it would be useful to 
examine potential reductions in error that might be made 
by individualizing parameters to each KC- Student pair. 

An important caveat of this analysis is to note that our 
results do not speak to a general ’quality’ of a video, and 
indeed that is perhaps beyond the scope of a quantitative 
analysis. A video rated poorly by our metrics need not nec- 
essarily be a bad video, merely unrelated or unhelpful for 
a subsequent assessment task. The importance of this par- 
ticular property is a matter of educational policy, and thus 
beyond the scope of this paper. Our goal is not to supplant 
the role of instructor decisions in course management, only 
to support them. 
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